Information Flow in Social Groups 
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We present a study of information flow that takes into account the observation that an item 
relevant to one person is more likely to be of interest to individuals in the same social circle than 
those outside of it. This is due to the fact that the similarity of node attributes in social networks 
decreases as a function of the graph distance. An epidemic model on a scale-free network with 
this property has a finite threshold, implying that the spread of information is limited. We tested 
our predictions by measuring the spread of messages in an organization and also by numerical 
experiments that take into consideration the organizational distance among individuals. 



The problem of information flows in social organiza- 
tions is relevant to issues of productivity, innovation and 
the sorting out of useful ideas out of the general chatter 
of a community. How information spreads determines 
the speed with which individuals can act and plan their 
future activities. In particular, email has become the pre- 
dominant means of communication in the information so- 
ciety. It pervades business, social and scientific exchanges 
and as such it is a highly relevant area for research on 
communities and social networks. Not surprisingly, email 
has been established as an indicator of collaboration and 
knowledge exchange 0, 0, EI 0, 01 ■ Email is also a good 
medium for research because it provides plentiful data on 
personal communication in an electronic form. 

Since individuals tend to organize both formally and 
informally into groups based on their common activities 
and interests, the way information spreads is affected by 
the topology of the interaction network, not unlike the 
spread of a disease among individuals. Thus one would 
expect that epidemic models on graphs are relevant to 
the study of information flow in organizations. In par- 
ticular, recent work on epidemic propagation on scale 
free networks found that the threshold for an epidemic 
is zero, implying that a finite fraction of the graph be- 
comes infected for arbitrarily low transmission probabil- 
ities 0, 13 • The presence of additional network struc- 
ture was found to further influence the spread of disease 
on scale- free graphs 0, 0, 0] . 

There are, however, differences between information 
flows and the spread of viruses. While viruses tend to 
be indiscriminate, infecting any susceptible individual, 
information is selective and passed by its host only to 
individuals the host thinks would be interested in it. 

The information any individual is interested in depends 
strongly on their characteristics. Furthermore, individu- 
als with similar characteristics tend to associate with one 
another, a phenomenon known as homophily [I3 , ll3l , ll4j . 
Conversely, individuals many steps removed in a social 
network on average tend not to have as much in com- 
mon, as shown in a study jlol| of a network of Stanford 
student homepages and illustrated in Figure H 

We therefore introduce an epidemic model with decay 
in the transmission probability of a particular piece of in- 
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FIG. 1: Average similarity of Stanford student homepages as 
a function of the number of hyperlinks separating them. 



formation as a function of the distance between the orig- 
inating source and the current potential target. In the 
following analysis, we show that this epidemic model on 
a scale-free network has a finite threshold, implying that 
the spread of information is limited. We further tested 
our predictions by observing the prevalence of messages 
in an organization and also by numerical experiments 
that take into consideration the organizational distance 
among individuals. 

Consider the problem of information transmission in a 
power-law network whose degree distribution is given by 

m 



Pk = Ck~ 
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where a > 1, and C is determined by the normalization 
condition. The generating function of the distribution is 
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Following the analysis in |Tjj for the SIR (susceptible, 
infected, removed) model, we now estimate the probabil- 
ity Pm that the first person in the community who has 
received a piece of information will transmit it to m of 
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their neighbors. Using the binomial distribution, we find 
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where the transmissiblity T is the probability that a per- 
son will transmit an item to a neighbor and the super- 
script "(1)" refers to first neighbors, those who received 
the information directly from the initial source. The gen- 
erating function for pm is given by 



G {1 \x) = J2J2 Pk [ }T m (l-T) k - m x m (4) 

m— k—m ^ ' 

= G (l + (a;-l)T) = Go(a:;T). (5) 



Suppose the transmissibility decays as a power of the 
distance from the initial source. We choose this weakest 
form of decay as the results that are obtained from it 
will also be valid for stronger functional forms. Then 
the probability that an mth neighbor will transmit the 
information to a person with whom he has contact is 
given by 



rp(m) 



(6) 



where [3 > is the decay constant. T^™' = T at the 
originating node (m = 0) and decays to zero as m — > oo. 

The distribution of the number of 2nd neighbors can 
be written as 

G (2) (*) = £ Pk } iG[ 1} (x)] k = G« (Gf («)), (7) 



where 

Gf'(x) - G x {x; 2~ T) =G 1 (l + (x- 1)2"^T). (8) 

Similarly, if we define G^ m ^(x) to be the the generating 
function for the number of mth neighbors affected, then 
we have 

G (™+i)( x ) = G w (G[ m) (x)) form>l, (9) 

where 

G[ m) {x) = G x {x; (ro+l)-"T) = Gi(l+(a:-l)(m+l)- /5 T) ) 

(10) 

and 

(ii) 



Giix) = = ~G' {x). 



G' (l) z 
Or, more explicitly, 

G( m + 1 \x) = G^(G i 1 1) (G^ ) (---G 1 m \x)))). (12) 
The average number z m+ \ of (m + l)th neighbors is 

z m+1 = G(™ +1 )'(1) = Gi m) '(l)G(™)'(l) = G[ m) '(l)z m . 

(13) 
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FIG. 2: T c as a function of a. The three different curves, from 
bottom to top are: 1) no decay in transmission probability, no 
exponential cutoff in the degree distribution (k = oo,/3 = 0). 
2) k = 100, Q = 0, 3) k = 100, /?==!. 



So the condition that the size of the outbreak remains 
finite is given by 



= G^ m) '(l) < 1, 



{m + l)-^TG[(l) < 1. 



(14) 



(15) 



For any given T, the left hand side of the inequality above 
goes to zero when m — > oo, so the condition is eventually 
satisfied for large m. Therefore the average total size 



(16) 



is always finite if the transmissibility decays with dis- 
tance. Note that if T is constant the average total size is 
infinite for values of a < 3 as shown previously. 

In the real world however, the size of a network is 
always finite, and in order to define a transmissibility 
threshold one needs an outbreak size that is compati- 
ble with the size of the whole network. Furthermore, 
many real world networks have a cutoff k far below 
their size. Thus we can write for the link distribution 
Pk = Ck~ a exp(— fc/ k). 

As an example, consider a network made up of 10 6 ver- 
tices. We define an epidemic to be an outbreak affecting 
more than 1% or 10 4 vertices. Thus for fixed a, K and j3, 
we can define T c as the transmissibility above which (s) 
would be made to exceed 10 4 . 

The numerical result of T c as a function of a is shown 
in Fig. where we choose k = 100 and (3=1. It is 
seen that when there is no decay, T c is very near zero 
for a close to 2, which means that for most values of 
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FIG. 3: Number of people receiving URLs and attachments 

T epidemics occur. However, when the transmissibility 
decays, T c rises substantially. For example, T c jumps to 
0.54 at a = 2, implying that the information may not 
spread over the network. 

In order to validate empirically that the spread of infor- 
mation within a network of people is limited, and hence 
distinct from the spread of a virus, we gathered a sam- 
ple from the mail clients of 40 individuals (30 within HP 
Labs, and 10 from other areas of HP, other research labs, 
and universities). Each volunteer executed a program 
that identified URLs and attachments in the messages 
in their mailboxes, as well as they time the messages 
were received. This data was cryptographically hashed 
to protect the privacy of the users. By analyzing the 
message content and headers, we restricted our data to 
include only messages which had been forwarded at least 
one time, thereby eliminating most postings to mailing 
lists and more closely approximating true inter-personal 
information spreading behavior. The median number of 
messages in a mailbox in our sample is 2200, indicating 
that many users keep a substantial portion of their email 
correspondence. Although some messages may have been 
lost when users deleted them, we assume that a major- 
ity of messages containing useful information had been 
retained. 

Figure [3] shows a histogram of how many users had 
received each of the 3401 attachments and 6370 URLs. 
The distribution shows that only a small fraction (5% of 
attachments and 10% of URLs) reach more than 1 re- 
cipient. Very few (41 URLs and 6 attachments) reached 
more than 5 individuals, a number which, in a sample of 
40, starts to resemble an outbreak. In follow-up discus- 
sions with our study subjects, we were able to identify 
the content and significance of most of these messages. 
14 of the URLs were advertisements attached to the bot- 
tom of an email by free email services such as Yahoo and 



FIG. 4: Outdegree distribution for all senders (224,514 in 
total) sending email to or from the HP Labs email server over 
the course of 3 months. The outdegree of a node is the number 
of correspondents the node sent email to. 

MSN. These are in a sense viral, because the sender is 
sending them involuntarily. It is this viral strategy that 
was responsible for the rapid buildup of the Hotmail free 
email service user base. 10 URLs pointed to internal HP 
project or personal pages, 3 URLs were for external com- 
mercial or personal sites, and the remaining 14 could not 
be identified. 

In our sample, one group is overrepresented, allowing 
us to observe both the spread of information within a 
close group, and the lack of information spread across 
groups. A number of attachments reaching four or more 
people were resumes circulated within one group. A few 
attachments were announcements passed down by higher 
level management. This kind of top down transmission 
within an organization is another path through which 
information can be efficiently disseminated. 

Next we simulated the effect of decay in the transmis- 
sion probability on the email graph at HP Labs in Palo 
Alto, CA. The graph was constructed from recorded logs 
of all incoming and outgoing messages over a period of 
3 months. The graph has a nearly power-law out degree 
distribution, shown in Figure 01 including both internal 
and external nodes. Because all of the outgoing and in- 
coming contacts were recorded for internal nodes, their in 
and out degrees were higher than for the external nodes 
for which we could only record the email they sent to and 
received from HP Labs. We however considered a graph 
with the internal and external nodes mixed (as in |l8j1 to 
demonstrate the effect of a decay on the spread of email 
specifically in a power-law graph. 

We simulated the spread of an epidemic by selecting 
a random initial sender to infect and following the email 
log containing 120,000 entries involving over 7,000 recip- 
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FIG. 5: Average outbreak and epidemic size as a function of 
the transmission probability p. 



ients in the course of a week. Every time an infective 
individual was recorded as sending an email to someone 
else, they had a constant probability p of infecting the 
recipient. Hence individuals who email more often have 
a higher probability of infecting. We also assume that an 
individual remains infective (willing to transmit a partic- 
ular piece of information) for a period of 24 hours. 

Next we introduced a decay in the transmission prob- 
ability p as p * 5 , where dij is the distance in the 
organizational hierarchy between two individuals. This 
exponent roughly corresponds to the decay in similarity 
between homepages shown in Figure ^ The decay rep- 
resents the fact that individuals closer together in the 
organizational hierarchy share more common interests. 
Individuals have a distance of one to their immediate su- 
periors and subordinates and to those they share a supe- 
rior with. The distance between someone within HP labs 
and someone outside of HP labs was set to the maximum 
hierarchical distance of 8. 

In figure |S] we show the variation in the average out- 
break size, and the average epidemic size (chosen to be 
any outbreak affecting more than 30 individuals). With- 
out decay, the epidemic threshold falls below p = 0.01. 
With decay, the threshold is set back to p — 0.20 and the 
outbreak epidemic size is limited to about 50 individuals, 
even for p = 1. 

As these results show, the decay of similarity among 
members of a social group has strong implications for the 
propagation of information among them. In particular, 



the number of individuals that a given email message 
reaches is very small, in contrast to what one would ex- 
pect on the basis of a virus epidemic model on a scale free 
graph. The implication of this finding is that merely dis- 
covering hubs in a community network is not enough to 
ensure that information originating at a particular node 
will reach a large fraction of the community. We expect 
that these findings are also valid with other means of so- 
cial communication, such as verbal exchanges, telephony 
and instant messenger systems. 
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